Scalable data visualisation via grammar of graphics

Ben Lambert

Grammar of graphics

  • single framework that easily generalises to larger sets of variables with complex hierarchies
  • framework for a huge variety of graphics
  • agile exploration of data and creation of coherent graphics

What this course is not

drawing

What is it then?

  • Personal view on why grammar of graphics dominates for most datasets
  • Attempt to entice people to find out more for themselves

Why don’t more people use it?

In my experience, people:

  • lack exposure to it
  • find the syntax confusing and “overly complex”
  • don’t see the benefits over “traditional” data plotting
  • are busy

Course structure

  • Short lectures
  • Problem set: answers in R and Python

Suicide rates: 1985 to 2016

country year age sex suicides_no population
Bahrain 1985 25-34 years male 7 67600
Bahrain 1985 35-54 years male 3 49700
Bahrain 1985 35-54 years female 1 26900
Bahrain 1985 15-24 years female 0 37800
Bahrain 1985 25-34 years female 0 27600
Bahrain 1985 5-14 years female 0 41400
Bahrain 1985 55-74 years female 0 8700
Bahrain 1985 75+ years female 0 1500
Bahrain 1985 15-24 years male 0 49700
Bahrain 1985 5-14 years male 0 42000
Bahrain 1985 55-74 years male 0 11600
Bahrain 1985 75+ years male 0 1400
Bahrain 1987 55-74 years male 1 11800
Bahrain 1987 35-54 years male 3 47700
Bahrain 1987 25-34 years male 3 63900
Bahrain 1987 25-34 years female 1 27500
Bahrain 1987 15-24 years female 1 38600
Bahrain 1987 35-54 years female 0 27300
Bahrain 1987 5-14 years female 0 42500
Bahrain 1987 55-74 years female 0 8900
Bahrain 1987 75+ years female 0 1500
Bahrain 1987 15-24 years male 0 49400
Bahrain 1987 5-14 years male 0 43200
Bahrain 1987 75+ years male 0 1600
Bahrain 1988 25-34 years male 8 61600
Bahrain 1988 35-54 years male 4 63500
Bahrain 1988 35-54 years female 1 30900
Bahrain 1988 15-24 years male 1 44100
Bahrain 1988 15-24 years female 0 38300
Bahrain 1988 25-34 years female 0 32000
Bahrain 1988 5-14 years female 0 45400
Bahrain 1988 55-74 years female 0 10300
Bahrain 1988 75+ years female 0 1400
Bahrain 1988 5-14 years male 0 46500
Bahrain 1988 55-74 years male 0 14400
Bahrain 1988 75+ years male 0 1300
Bahrain 1997 15-24 years female 0 46573
Bahrain 1997 25-34 years female 0 50841
Bahrain 1997 35-54 years female 0 53562
Bahrain 1997 5-14 years female 0 55390
Bahrain 1997 55-74 years female 0 13826
Bahrain 1997 75+ years female 0 2250
Bahrain 1997 15-24 years male 0 54369
Bahrain 1997 25-34 years male 0 80297
Bahrain 1997 35-54 years male 0 96825
Bahrain 1997 5-14 years male 0 57431
Bahrain 1997 55-74 years male 0 16035
Bahrain 1997 75+ years male 0 2355
Bahrain 1998 15-24 years female 0 48147
Bahrain 1998 25-34 years female 0 51596
Bahrain 1998 35-54 years female 0 56917
Bahrain 1998 5-14 years female 0 56626
Bahrain 1998 55-74 years female 0 14404
Bahrain 1998 75+ years female 0 2312
Bahrain 1998 15-24 years male 0 56377
Bahrain 1998 25-34 years male 0 80849
Bahrain 1998 35-54 years male 0 101013
Bahrain 1998 5-14 years male 0 58935
Bahrain 1998 55-74 years male 0 16409
Bahrain 1998 75+ years male 0 2423
Bahrain 1999 25-34 years male 8 82791
Bahrain 1999 35-54 years male 4 105903
Bahrain 1999 15-24 years male 2 57914
Bahrain 1999 15-24 years female 1 49631
Bahrain 1999 25-34 years female 1 52339
Bahrain 1999 35-54 years female 1 60568
Bahrain 1999 5-14 years female 0 58374
Bahrain 1999 55-74 years female 0 15017
Bahrain 1999 75+ years female 0 2339
Bahrain 1999 5-14 years male 0 61554
Bahrain 1999 55-74 years male 0 16810
Bahrain 1999 75+ years male 0 2522
Bahrain 2000 25-34 years male 16 86231
Bahrain 2000 35-54 years male 10 112060
Bahrain 2000 15-24 years male 2 59208
Bahrain 2000 25-34 years female 1 53263
Bahrain 2000 15-24 years female 0 50968
Bahrain 2000 35-54 years female 0 64617
Bahrain 2000 5-14 years female 0 60791
Bahrain 2000 55-74 years female 0 15556
Bahrain 2000 75+ years female 0 2394
Bahrain 2000 5-14 years male 0 65512
Bahrain 2000 55-74 years male 0 17346
Bahrain 2000 75+ years male 0 2648
Bahrain 2001 25-34 years male 8 90920
Bahrain 2001 35-54 years male 8 118151
Bahrain 2001 55-74 years male 1 17917
Bahrain 2001 15-24 years male 2 68309
Bahrain 2001 15-24 years female 1 54497
Bahrain 2001 25-34 years female 1 55961
Bahrain 2001 5-14 years male 1 67127
Bahrain 2001 35-54 years female 1 67856
Bahrain 2001 5-14 years female 0 62316
Bahrain 2001 55-74 years female 0 15791
Bahrain 2001 75+ years female 0 2544
Bahrain 2001 75+ years male 0 2694
Bahrain 2002 25-34 years male 7 97288
Bahrain 2002 35-54 years male 9 126500
Bahrain 2002 35-54 years female 4 71376
Bahrain 2002 15-24 years female 2 57599

Comparing traditional and GG way

Standard way

plot(df$population, df$suicides_no)

Plotting with ggplot / plotnine

ggplot(df, aes(x=population, y=suicides_no)) +
  geom_point()

How to colour points according to sex?

Sex the traditional way (uses “wide” format)

country year age population_male population_female suicides_no_male suicides_no_female
Bahrain 1985 25-34 years 67600 27600 7 0
Bahrain 1985 35-54 years 49700 26900 3 1
Bahrain 1985 15-24 years 49700 37800 0 0
Bahrain 1985 5-14 years 42000 41400 0 0
Bahrain 1985 55-74 years 11600 8700 0 0
Bahrain 1985 75+ years 1400 1500 0 0
Bahrain 1987 55-74 years 11800 8900 1 0
Bahrain 1987 35-54 years 47700 27300 3 0
Bahrain 1987 25-34 years 63900 27500 3 1
Bahrain 1987 15-24 years 49400 38600 0 1
Bahrain 1987 5-14 years 43200 42500 0 0
Bahrain 1987 75+ years 1600 1500 0 0
Bahrain 1988 25-34 years 61600 32000 8 0
Bahrain 1988 35-54 years 63500 30900 4 1
Bahrain 1988 15-24 years 44100 38300 1 0
Bahrain 1988 5-14 years 46500 45400 0 0
Bahrain 1988 55-74 years 14400 10300 0 0
Bahrain 1988 75+ years 1300 1400 0 0
Bahrain 1997 15-24 years 54369 46573 0 0
Bahrain 1997 25-34 years 80297 50841 0 0
Bahrain 1997 35-54 years 96825 53562 0 0
Bahrain 1997 5-14 years 57431 55390 0 0
Bahrain 1997 55-74 years 16035 13826 0 0
Bahrain 1997 75+ years 2355 2250 0 0
Bahrain 1998 15-24 years 56377 48147 0 0
Bahrain 1998 25-34 years 80849 51596 0 0
Bahrain 1998 35-54 years 101013 56917 0 0
Bahrain 1998 5-14 years 58935 56626 0 0
Bahrain 1998 55-74 years 16409 14404 0 0
Bahrain 1998 75+ years 2423 2312 0 0
Bahrain 1999 25-34 years 82791 52339 8 1
Bahrain 1999 35-54 years 105903 60568 4 1
Bahrain 1999 15-24 years 57914 49631 2 1
Bahrain 1999 5-14 years 61554 58374 0 0
Bahrain 1999 55-74 years 16810 15017 0 0
Bahrain 1999 75+ years 2522 2339 0 0
Bahrain 2000 25-34 years 86231 53263 16 1
Bahrain 2000 35-54 years 112060 64617 10 0
Bahrain 2000 15-24 years 59208 50968 2 0
Bahrain 2000 5-14 years 65512 60791 0 0
Bahrain 2000 55-74 years 17346 15556 0 0
Bahrain 2000 75+ years 2648 2394 0 0
Bahrain 2001 25-34 years 90920 55961 8 1
Bahrain 2001 35-54 years 118151 67856 8 1
Bahrain 2001 55-74 years 17917 15791 1 0
Bahrain 2001 15-24 years 68309 54497 2 1
Bahrain 2001 5-14 years 67127 62316 1 0
Bahrain 2001 75+ years 2694 2544 0 0
Bahrain 2002 25-34 years 97288 59350 7 1
Bahrain 2002 35-54 years 126500 71376 9 4
Bahrain 2002 15-24 years 76442 57599 2 2
Bahrain 2002 5-14 years 68596 63714 0 0
Bahrain 2002 55-74 years 18646 16006 0 0
Bahrain 2002 75+ years 2755 2738 0 0
Bahrain 2003 35-54 years 137287 75430 17 0
Bahrain 2003 25-34 years 105799 63380 9 5
Bahrain 2003 15-24 years 83143 60298 3 0
Bahrain 2003 5-14 years 70584 65298 0 0
Bahrain 2003 55-74 years 19574 16343 0 0
Bahrain 2003 75+ years 2831 2946 0 0
Bahrain 2004 55-74 years 20717 16975 4 0
Bahrain 2004 25-34 years 116777 68052 19 3
Bahrain 2004 35-54 years 150835 80288 10 3
Bahrain 2004 15-24 years 88431 62699 2 1
Bahrain 2004 5-14 years 73422 67313 0 0
Bahrain 2004 75+ years 2924 3126 0 0
Bahrain 2005 25-34 years 142375 73758 15 3
Bahrain 2005 35-54 years 174785 92102 14 2
Bahrain 2005 55-74 years 27261 19983 2 0
Bahrain 2005 15-24 years 79928 64792 4 2
Bahrain 2005 5-14 years 69742 66362 0 0
Bahrain 2005 75+ years 4063 4389 0 0
Bahrain 2006 25-34 years 158656 80271 12 5
Bahrain 2006 15-24 years 86090 69020 1 3
Bahrain 2006 55-74 years 29960 21363 1 0
Bahrain 2006 35-54 years 192597 100668 3 3
Bahrain 2006 5-14 years 71974 68590 0 0
Bahrain 2006 75+ years 4439 4915 0 0
Bahrain 2007 25-34 years 176938 87472 14 3
Bahrain 2007 35-54 years 212426 110173 13 2
Bahrain 2007 55-74 years 32965 22872 2 0
Bahrain 2007 15-24 years 92875 73617 5 1
Bahrain 2007 5-14 years 74222 70861 0 0
Bahrain 2007 75+ years 4852 5497 0 0
Bahrain 2008 25-34 years 196832 93449 21 2
Bahrain 2008 35-54 years 225866 115468 13 3
Bahrain 2008 15-24 years 96458 76246 0 3
Bahrain 2008 55-74 years 35879 24495 1 0
Bahrain 2008 5-14 years 76617 72955 0 0
Bahrain 2008 75+ years 4654 5191 0 0
Bahrain 2009 35-54 years 242558 118835 14 2
Bahrain 2009 25-34 years 222448 99580 9 4
Bahrain 2009 15-24 years 102256 80832 4 0
Bahrain 2009 55-74 years 38002 25915 1 0
Bahrain 2009 5-14 years 79180 75122 2 0
Bahrain 2009 75+ years 4091 4103 0 0
Bahrain 2010 35-54 years 252232 125709 8 2
Bahrain 2010 25-34 years 235591 104368 6 1
Bahrain 2010 55-74 years 41999 27834 1 0
Bahrain 2010 15-24 years 103388 80571 0 1

Traditional: plot data separately for each subpopulation

plot(df_wide$population_male, df_wide$suicides_no_male, col="red")
points(df_wide$population_female, df_wide$suicides_no_female, col="blue")

GG way uses long format

country year age sex suicides_no population
Bahrain 1985 25-34 years male 7 67600
Bahrain 1985 35-54 years male 3 49700
Bahrain 1985 35-54 years female 1 26900
Bahrain 1985 15-24 years female 0 37800
Bahrain 1985 25-34 years female 0 27600
Bahrain 1985 5-14 years female 0 41400
Bahrain 1985 55-74 years female 0 8700
Bahrain 1985 75+ years female 0 1500
Bahrain 1985 15-24 years male 0 49700
Bahrain 1985 5-14 years male 0 42000
Bahrain 1985 55-74 years male 0 11600
Bahrain 1985 75+ years male 0 1400
Bahrain 1987 55-74 years male 1 11800
Bahrain 1987 35-54 years male 3 47700
Bahrain 1987 25-34 years male 3 63900
Bahrain 1987 25-34 years female 1 27500
Bahrain 1987 15-24 years female 1 38600
Bahrain 1987 35-54 years female 0 27300
Bahrain 1987 5-14 years female 0 42500
Bahrain 1987 55-74 years female 0 8900
Bahrain 1987 75+ years female 0 1500
Bahrain 1987 15-24 years male 0 49400
Bahrain 1987 5-14 years male 0 43200
Bahrain 1987 75+ years male 0 1600
Bahrain 1988 25-34 years male 8 61600
Bahrain 1988 35-54 years male 4 63500
Bahrain 1988 35-54 years female 1 30900
Bahrain 1988 15-24 years male 1 44100
Bahrain 1988 15-24 years female 0 38300
Bahrain 1988 25-34 years female 0 32000
Bahrain 1988 5-14 years female 0 45400
Bahrain 1988 55-74 years female 0 10300
Bahrain 1988 75+ years female 0 1400
Bahrain 1988 5-14 years male 0 46500
Bahrain 1988 55-74 years male 0 14400
Bahrain 1988 75+ years male 0 1300
Bahrain 1997 15-24 years female 0 46573
Bahrain 1997 25-34 years female 0 50841
Bahrain 1997 35-54 years female 0 53562
Bahrain 1997 5-14 years female 0 55390
Bahrain 1997 55-74 years female 0 13826
Bahrain 1997 75+ years female 0 2250
Bahrain 1997 15-24 years male 0 54369
Bahrain 1997 25-34 years male 0 80297
Bahrain 1997 35-54 years male 0 96825
Bahrain 1997 5-14 years male 0 57431
Bahrain 1997 55-74 years male 0 16035
Bahrain 1997 75+ years male 0 2355
Bahrain 1998 15-24 years female 0 48147
Bahrain 1998 25-34 years female 0 51596
Bahrain 1998 35-54 years female 0 56917
Bahrain 1998 5-14 years female 0 56626
Bahrain 1998 55-74 years female 0 14404
Bahrain 1998 75+ years female 0 2312
Bahrain 1998 15-24 years male 0 56377
Bahrain 1998 25-34 years male 0 80849
Bahrain 1998 35-54 years male 0 101013
Bahrain 1998 5-14 years male 0 58935
Bahrain 1998 55-74 years male 0 16409
Bahrain 1998 75+ years male 0 2423
Bahrain 1999 25-34 years male 8 82791
Bahrain 1999 35-54 years male 4 105903
Bahrain 1999 15-24 years male 2 57914
Bahrain 1999 15-24 years female 1 49631
Bahrain 1999 25-34 years female 1 52339
Bahrain 1999 35-54 years female 1 60568
Bahrain 1999 5-14 years female 0 58374
Bahrain 1999 55-74 years female 0 15017
Bahrain 1999 75+ years female 0 2339
Bahrain 1999 5-14 years male 0 61554
Bahrain 1999 55-74 years male 0 16810
Bahrain 1999 75+ years male 0 2522
Bahrain 2000 25-34 years male 16 86231
Bahrain 2000 35-54 years male 10 112060
Bahrain 2000 15-24 years male 2 59208
Bahrain 2000 25-34 years female 1 53263
Bahrain 2000 15-24 years female 0 50968
Bahrain 2000 35-54 years female 0 64617
Bahrain 2000 5-14 years female 0 60791
Bahrain 2000 55-74 years female 0 15556
Bahrain 2000 75+ years female 0 2394
Bahrain 2000 5-14 years male 0 65512
Bahrain 2000 55-74 years male 0 17346
Bahrain 2000 75+ years male 0 2648
Bahrain 2001 25-34 years male 8 90920
Bahrain 2001 35-54 years male 8 118151
Bahrain 2001 55-74 years male 1 17917
Bahrain 2001 15-24 years male 2 68309
Bahrain 2001 15-24 years female 1 54497
Bahrain 2001 25-34 years female 1 55961
Bahrain 2001 5-14 years male 1 67127
Bahrain 2001 35-54 years female 1 67856
Bahrain 2001 5-14 years female 0 62316
Bahrain 2001 55-74 years female 0 15791
Bahrain 2001 75+ years female 0 2544
Bahrain 2001 75+ years male 0 2694
Bahrain 2002 25-34 years male 7 97288
Bahrain 2002 35-54 years male 9 126500
Bahrain 2002 35-54 years female 4 71376
Bahrain 2002 15-24 years female 2 57599

GG: associate extra aesthetic (“colour”) to data points

ggplot(df, aes(x=population, y=suicides_no, colour=sex)) +
  geom_point()

Agile data visualisation with GG

Plots operate on aesthetic mappings

aes(x=population, y=suicides_no, colour=sex)

  • is an example of an aesthetic mapping: it associates aesthetics with values for each of your data points
  • here it associates:
    • the horizontal position of data points with population
    • the vertical with suicides count
    • the colour of points with sex
  • many other aesthetics, such as shape and size, are possible
  • geoms (see later) control how aesthetics are displayed

Colour by country: traditionally annoying

ggplot(df, aes(x=population, y=suicides_no, colour=country)) +
  geom_point()

Shape by country: traditionally annoying

ggplot(df, aes(x=population, y=suicides_no, shape=country)) +
  geom_point()

Regressions by country

ggplot(df, aes(x=population, y=suicides_no, colour=country)) +
  geom_point(alpha=0.3) + geom_smooth(method="lm", se=F)

Overall regression

ggplot(df, aes(x=population, y=suicides_no)) +
  geom_point(alpha=0.3, aes(colour=country)) +
  geom_smooth(method="lm", se=F, colour="black")

What is a geom?

  • geom_point and geom_smooth are both geometrical elements (“geoms”) used to represent data
  • here they both take the same x and y variable and use it to produce a different visualisation
  • other geom examples are geom_line, geom_histogram, geom_violin, geom_rectangle

Simple geom example

x y label
1 2 a
2 4 b
3 10 c

point

ggplot(data_df, aes(x, y, label = label)) +
  theme(text=element_text(size=14)) +
  geom_point()

text

ggplot(data_df, aes(x, y, label = label)) +
  theme(text=element_text(size=14)) +
  geom_text(size=18)

col

ggplot(data_df, aes(x, y, label = label)) +
  theme(text=element_text(size=14)) +
  geom_col()

line

ggplot(data_df, aes(x, y, label = label)) +
  theme(text=element_text(size=14)) +
  geom_line()

line and points

ggplot(data_df, aes(x, y, label = label)) +
  theme(text=element_text(size=14)) +
  geom_line() + geom_point()

line and jittered points

ggplot(data_df, aes(x, y, label = label)) +
  theme(text=element_text(size=14)) +
  geom_line() + geom_jitter()

regression line and jitter

ggplot(data_df, aes(x, y, label = label)) +
  theme(text=element_text(size=14)) +
  geom_smooth(method="lm", se=F, formula = y~x) + geom_jitter()

polygon

ggplot(data_df, aes(x, y, label = label)) +
  theme(text=element_text(size=14)) +
  geom_polygon()

Order of layering

ggplot(df, aes(x=population, y=suicides_no)) +
  geom_point(alpha=1, aes(colour=country)) +
  geom_smooth(method="lm", se=F, colour="black")

Order of layering

ggplot(df, aes(x=population, y=suicides_no)) +
  geom_smooth(method="lm", se=F, colour="black") +
  geom_point(alpha=1, aes(colour=country))

Change axis scales

ggplot(df, aes(x=population, y=suicides_no)) +
  geom_point(alpha=0.3, aes(colour=country)) +
  scale_x_sqrt() + scale_y_sqrt()

Boxplots

ggplot(df, aes(x=as.factor(year), y=suicides_no)) +
  geom_boxplot()

Boxplots flipped

ggplot(df, aes(x=as.factor(year), y=suicides_no)) +
  geom_boxplot() +
  coord_flip()

Separate by age group

ggplot(df, aes(x=year, y=suicides_no, colour=country, shape=age)) +
  geom_point(alpha=0.8)

Facet panelling

  • one way to add variables is with aesthetics
  • another way, especially useful for categorical variables, is to split plots into facets
  • each facet represents a plot of a subset of your data

Facet by country

ggplot(df, aes(x=year, y=suicides_no, shape=age)) +
  geom_point() + facet_wrap(~country)

Age and country panels

ggplot(df, aes(x=year, y=suicides_no)) +
  geom_point() + facet_grid(vars(country), vars(age), scales="free")

Add in sex

ggplot(df, aes(x=year, y=suicides_no, colour=sex)) +
  geom_point() + facet_grid(vars(country), vars(age), scales="free")

Change geom to line

ggplot(df, aes(x=year, y=suicides_no, colour=sex)) +
  geom_line() + facet_grid(vars(country), vars(age), scales="free")

Adding linear regressions

ggplot(df, aes(x=year, y=suicides_no, colour=sex)) +
  geom_line() + facet_grid(vars(country), vars(age), scales="free") +
  geom_smooth(method="lm", se=F)

Conclusions

Benefits of GG

  • agile data exploration: keep tinkering till it looks right
  • aesthetics allow layering of hierarchies of features
  • geoms handle a lot so less to get wrong
  • dominates traditional graphics for rich datasets

Packages

  • R: ggplot2
  • Python: plotnine (essentially ggplot2) and Plotly

Where to learn more

drawing